Harden WAF pre-flight + breaker edge cases#368
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces Unreleased changelog entry and implements WAF and job-lifecycle reliability fixes: case-insensitive probe scheme detection, Akamai cookie detection, BlockJob CAS guard + fail-fallback, async BlockJob dispatch with re-arm on failure, enqueue terminal short-circuit, and sitemap mid-batch terminal checks. Changes
Sequence Diagram(s)sequenceDiagram
participant WAF as "WAF Observer"
participant Breaker as "WAFCircuitBreaker"
participant Dispatch as "Dispatch Goroutine\n(30s timeout)"
participant JM as "JobManager.BlockJob"
participant DB as "Database"
WAF->>Breaker: observe blocked outcome for jobID
Breaker->>Breaker: mark jobID tripped
Breaker->>Dispatch: spawn detached dispatch (returns immediately)
Dispatch->>JM: call BlockJob(jobID) (detached ctx, 30s)
JM->>DB: UPDATE jobs ... WHERE id=? AND status IN (pre-terminal)
DB-->>JM: RowsAffected (1 => success / 0 => race lost)
alt success
JM-->>Dispatch: success
Dispatch-->>Breaker: no rearm
else failure or timeout
Dispatch-->>Breaker: report failure
Breaker->>Breaker: Rearm(jobID) (clear tripped flag)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Updates to Preview Branch (work/waf-followups-a) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogFixed
Changed
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/jobs/manager.go`:
- Line 588: The job failure message currently includes raw internal error text
by concatenating err.Error() into the call to jm.failJobWithMessage; replace
that concatenation with a stable, customer-safe message (e.g., "WAF detected but
block transition failed") and log the detailed err separately to the service
logger. Locate the failing call to jm.failJobWithMessage (using job.ID) and
update it to pass only the generic message, then add a nearby log entry (using
the existing logger) that records err and contextual info for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 067a8b31-595d-4c1e-8dfd-d80dbba86946
📒 Files selected for processing (5)
CHANGELOG.mdinternal/crawler/probe.gointernal/crawler/probe_test.gointernal/jobs/fail_job_message_test.gointernal/jobs/manager.go
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
1 similar comment
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/jobs/waf_circuit_breaker.go`:
- Around line 121-128: Rearm currently only clears the tripped flag so a job
must accumulate threshold more failures before retrying; change Rearm (and the
other similar rearm occurrence) to also seed the per-job counter to threshold-1
and set the last vendor so the next blocked outcome will immediately retrip:
locate WAFCircuitBreaker.Rearm and the duplicate rearm block, acquire the mutex
as now, unset tripped[jobID], then set counts[jobID] = b.threshold - 1 and
lastVendor[jobID] = <previous vendor value stored on trip> (or preserve existing
lastVendor if already set) so Observe() logic will trigger a retrip on the next
blocked event. Ensure you handle nil/empty jobID checks exactly as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7fc90932-6ad5-4227-b716-42b3f36a5ec2
📒 Files selected for processing (4)
CHANGELOG.mdinternal/crawler/waf.gointernal/crawler/waf_test.gointernal/jobs/waf_circuit_breaker.go
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @simonsmallchua. The following files were modified: * `internal/crawler/probe.go` * `internal/crawler/waf.go` * `internal/db/queue.go` These files were ignored: * `internal/crawler/probe_test.go` * `internal/crawler/waf_test.go` * `internal/db/queue_test.go` * `internal/jobs/block_job_test.go` * `internal/jobs/fail_job_message_test.go` * `internal/jobs/sitemap_terminal_guard_test.go` * `internal/jobs/waf_circuit_breaker_dispatch_test.go` These file types are not supported: * `CHANGELOG.md`
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-368.fly.dev |
Summary
All five CodeRabbit follow-ups to #367 (WAF detection, row 1 of #365), in one
PR per request. Each addresses a finding that would only bite under a specific
failure mode; none change the contract working in production today.
Three small ones (commit 1)
runWAFPreflight): ifBlockJob's DB write errs, transition the job tofailedvia a status-guarded write rather than returning to the caller with the job still inpendingand no tasks.JobStatusBlockedterminal cleanup (manager.goUpdateJobStatus): addsJobStatusBlockedto the case-arm that dropsprocessedPagesand milestone state, matching the other terminal statuses.normaliseProbeTarget):HTTPS://example.comno longer double-prefixes tohttps://HTTPS://....Two meatier ones (commit 2)
BlockJobCAS guard (manager.goBlockJob): the terminalUPDATE jobsnow restricts to pre-terminal statuses and rolls the whole tx back if zero rows match. Stops a staleGetJobpre-read from overwriting a freshly-completed/failed/cancelled job and stampingdomains.waf_blockedoff a verdict that didn't land. Race-lost surfaces as nil success, not a red error.MaybeTripFromOutcome):BlockJobnow runs in a detached goroutine with a 30 s timeout so the stream worker hot path can't stall on terminal-state DB lock contention. OnBlockJobfailure the breaker re-arms for the job — a transient DB blip no longer permanently disables it.Tests added
TestNormaliseProbeTarget— table-driven, covers mixed-case schemes.TestFailJobWithMessage_StatusGuard— sqlmock, asserts the fallback SQL has the four-status WHERE guard.TestBlockJob_RaceLostReturnsNil— sqlmock, asserts the CAS miss rolls the tx back and surfaces as nil success (no domain stamp, no outbox delete).TestBlockJob_LockOrder— updated to assert the new SQL signature including the CAS clause.TestMaybeTripFromOutcome_AsyncDispatch— holds BlockJob inside a stub via a channel; verifies the caller returns promptly.TestMaybeTripFromOutcome_RearmAfterFailure— BlockJob returns error; subsequentObservefor the same job must trip again.TestMaybeTripFromOutcome_NoRearmOnSuccess— BlockJob returns nil; subsequentObservemust NOT trip again (single-fire preserved).Test plan
go test ./...green (verified locally).target.com.au/woolworths.com.aublocked behaviour unchanged.amazon.comhealthy behaviour unchanged.Summary by CodeRabbit
Bug Fixes
Behavior Changes